All the core concepts you need to understand before writing code for Experiments 4, 5 & 6
Use ← → keys or the buttons below to navigate
Libraries, imports, array creation
From dict, column/index types
np.power, ** operator
head(), tail(), loc, iloc
What NaN is, how it arises
isnull, fillna, dropna
Before writing any data code, you must import the two powerhouse libraries. These are already installed in your lab environment.
import numpy as np # array math, element-wise ops import pandas as pd # DataFrames & Series
N-dimensional arrays, mathematical functions like np.power. The backbone of scientific Python.
Built on top of NumPy. Provides DataFrame (2-D table) and Series (1-D column) — your main tools today.
Always use np and pd as aliases — every textbook, tutorial and StackOverflow answer uses them.
A NumPy array is a grid of same-type values. Unlike a Python list, every operation on it works element-by-element automatically.
a = np.array([2, 3, 4]) b = np.array([1, 2, 3]) print(a + b) # [3, 5, 7] ← adds index-by-index print(a * b) # [2, 6, 12] print(a ** b) # [2, 9, 64] ← element-wise power!
An operation between two same-shape arrays is done position by position. No loops needed.
All elements share one data type (int64, float64…). Pandas inherits this for DataFrame columns.
np.power(x1, x2) raises each element of x1 to the corresponding element of x2.
np.power(x1, x2) # x1 → base array | x2 → exponent array # result[i] = x1[i] ** x2[i] for every i
bases = np.array([2, 3, 4]) exps = np.array([3, 2, 1]) result = np.power(bases, exps) # result → [8, 9, 4] # 2³ 3² 4¹
a ** b equals np.power(a,b)A DataFrame is a 2-D labelled table — think of it as an Excel sheet inside Python. The most common way to build one is from a dictionary.
data = { 'X': [78, 85, 96], 'Y': [84, 94, 89], 'Z': [86, 97, 96], } df = pd.DataFrame(data) print(df)
| X | Y | Z | |
|---|---|---|---|
| 0 | 78 | 84 | 86 |
| 1 | 85 | 94 | 97 |
| 2 | 96 | 89 | 96 |
Each key becomes a column name. All lists must be equal length.
Pandas assigns a numeric row index unless you provide custom labels.
For Exp-5 the DataFrame uses letter labels instead of numbers. Pass your list of labels using the index= parameter.
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] df = pd.DataFrame(exam_data, index=labels)
When building the dictionary, use np.nan for missing entries. This requires importing NumPy first. It is the standard float sentinel for "no value".
These are the two most-used methods for peeking at a DataFrame quickly.
Returns the first n rows. Default is 5 if you omit the argument.
df.head(3) # rows a, b, c df.head() # rows a–e (first 5)
Returns the last n rows. Also defaults to 5.
df.tail(3) # rows h, i, j df.tail() # rows f–j (last 5)
Think of head/tail like a newspaper — head shows the headline section, tail shows the classifieds at the back. Both give you a slice without changing the original.
Two indexers that look similar but behave differently:
Label-based. Use the actual index labels ('a', 'b'…). End of slice is inclusive.
df.loc['a':'c'] # rows a, b, c ✓
Position-based. Always 0-indexed integers. End of slice is exclusive.
df.iloc[0:3] # positions 0,1,2 ✓
| Indexer | Works with | Slice end | Example |
|---|---|---|---|
| loc | Index labels | Inclusive | df.loc['a':'c'] |
| iloc | Integer position | Exclusive | df.iloc[0:3] |
Next up: understanding why missing values appear, how to detect them, and the three strategies to handle them.
NaN = "Not a Number". In Pandas, it represents a missing or undefined value in a floating-point column.
Pandas uses Python's float('nan') from IEEE-754. It propagates: NaN + 5 = NaN
Missing survey answers, failed sensor readings, unfilled form fields, data import errors.
Most statistical functions skip or miscount NaN. You must decide what to do with it before analysis.
All three look "empty" but are treated differently by Pandas. isnull() catches only NaN and None (Python's null). Zero and empty strings are valid values.
Before you fix missing data, you must find it. Pandas gives you several tools:
df.isnull() # True where NaN, False elsewhere df.isna() # alias — exactly the same df.notnull() # inverse: True where value exists df.isnull().sum() # count NaNs per column ★ most useful df.isnull().any() # True/False per column: any NaN?
Returns a boolean DataFrame — same shape, but each cell is True/False. You can chain with .sum() to count per column.
In Exp-5 data, score has 2 NaNs (rows d & h). Running df['score'].isnull().sum() returns 2.
df.fillna(value) substitutes every NaN with a value you choose. It returns a new DataFrame by default.
# Fill with a fixed number df.fillna(0) # Fill with column mean (very common in data science) df['score'].fillna(df['score'].mean()) # Fill forward (use previous row's value) df.fillna(method='ffill') # Fill backward (use next row's value) df.fillna(method='bfill')
By default fillna returns a copy. Add inplace=True to modify the original DataFrame directly: df.fillna(0, inplace=True)
df.dropna() removes any row (or column) that contains at least one NaN.
# Drop every row that has ANY NaN df.dropna() # Drop only rows where ALL values are NaN df.dropna(how='all') # Drop columns instead of rows df.dropna(axis=1) # Keep rows that have at least 3 non-NaN values df.dropna(thresh=3)
Only drop when the missing rows are few and random. Dropping too many rows can introduce bias into your analysis.
Fill when you have a sensible substitute (mean, 0, the next known value). Filling preserves dataset size.
| Situation | Best Approach | Method |
|---|---|---|
| No meaningful substitute exists, row is mostly empty | Drop the row | dropna() |
| Numerical column, want to preserve size | Fill with mean/median | fillna(mean) |
| Categorical column (Yes/No) | Fill with mode | fillna(mode[0]) |
| Time-series / ordered data | Forward or backward fill | fillna(method='ffill') |
| Replacement value is known (e.g., 0) | Fill with constant | fillna(0) |
Pro tip: Always inspect with isnull().sum() before and after any fill or drop to confirm your operation worked as expected.
These attributes let you inspect a DataFrame without printing the whole thing:
df.shape # (rows, cols) — e.g. (10, 4) df.dtypes # data type of each column df.columns # list of column names df.index # row labels df.info() # summary + non-null counts ★ df.describe() # stats: mean, std, min, max…
Shows column names, non-null count, and dtype. The fastest way to spot missing data at a glance.
Gives count, mean, std, min, percentiles, max for numeric columns. NaN rows are excluded from count.
A tuple — first number is rows, second is columns. No parentheses — it's a property, not a method.
import numpy as np import pandas as pd
np.power(x1, x2) — raises each element of x1 to corresponding element of x2
pd.DataFrame(data, index=labels) + df.head(n)
isnull() / isna() / isnull().sum()
fillna(value) to replace | dropna() to remove
🎯 Remember: Understand the concept first — then the code writes itself.
Open the accompanying Pandas.ipynb notebook directly in Google Colab — no installation needed, runs entirely in your browser.
Tip: Sign in with your Google account to save your work in Google Drive.